The molecular kinetics community often builds discrete-state models of continuous-state time-series, called Markov State Models (MSMs).
How do we compute the likelihood of a time-series given an MSM?
Let's say we want to compute the likelihood of a vector-valued sequence given an MSM. The MSM is defined by (1) a partitioning of the entire continuous space into $n$ non-overlapping chunks, and (2) an $n \times n$ transition matrix $T$, where each entry $T_{ij}$ is the probability that you'll be anywhere in chunk $j$ if you start anywhere in chunk $i$ and wait a while.
The likelihood of a discrete sequence, $S = (s_1,s_2,\dots,s_N)$ is easy, it's just:
$$p(S) = p(s_1) \prod_{t=2}^N p(s_t | s_{t-1}) = p(s_1) \prod_{t=2}^N T_{s_{t-1} s_t}$$This is what's currently done in MSMBuilder when you ask for a likelihood, for example: https://github.com/msmbuilder/msmbuilder/blob/master/msmbuilder/msm/msm.py#L318
But in the case of MSMs, this obviously doesn't tell the whole story. You could, for instance, choose a totally useless state-space partitioning that maps any possible vector to the same discrete state $(n=1, T = T_{11}=1.0)$, but evaluating this likelihood function would tell you that the probability of the sequence is $1.0$.
What we really want is the likelihood of a vector-valued sequence, $X = ( \vec{x}_1, \vec{x}_2, \dots, \vec{x}_N)$:
$$p(X) = p(\vec{x}_1) \prod_{t = 2}^N p(\vec{x}_t | \vec{x}_{t-1})$$Evaluating this from an MSM requires one more step. This is because the probability density at any individual point in a partition is not equal to the probability of the whole partition-- instead the probability density has to integrate to that value. Consequently, the transition probability $p(\vec{x}_t | \vec{x}_{t-1})$ isn't simply equal the corresponding element of $T$. $T_{ij}$ is actually an integral:
$$T_{ij} = \int_{s_j} p(\vec{y} | \vec{x})d\vec{y}$$where the "origin point" is anywhere in the partition $i$, $\vec{x} \in s_i$, and we integrate over all "destination points" in the partition $j$, $\vec{y} \in s_j$.
Since the transition density in an MSM is uniform within each partition, this integral should be easy. All we need is the volume $V_j$ of each partition:
$$\begin{aligned}T_{ij} & = V_j p(\vec{y} | \vec{x})\\ p(\vec{y} | \vec{x}) & = T_{ij}/{V_j}\end{aligned}$$for any $\vec{x} \in s_i$, $\vec{y} \in s_j$.
So now we can compute the vector-to-vector transition probability from quantities that the MSM defines: the volume of each partition, and the transition matrix.
In [1]:
# basic imports
import numpy as np
import numpy.random as npr
npr.seed(0)
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
# construct a 2-state Markov model, and sample a long trajectory
# the 2 states A and B are Gaussian blobs in 2D
A = (0,0)
B = (5,5)
centers = dict()
centers[0] = A
centers[1] = B
# the transition matrix is symmetric and metastable
trans_mat = np.array([[0.7,0.3],
[0.3,0.7]])
# sample a trajectory
traj_len=100000
traj=npr.randn(traj_len,2)
state = 0
for i in range(1,traj_len):
state = int(npr.rand() > trans_mat[state][0]) # only works for 2-state MSM
traj[i] += centers[state]
X = traj[:int(traj_len*0.9)]
X_test = traj[int(traj_len*0.9):]
In [3]:
plt.imshow(trans_mat,interpolation='none',cmap='Blues');
plt.xticks([0,1])#,[r'$\mathcal{N}((0,0),1)$',r'$\mathcal{N}((5,5),1)$'])
plt.yticks([0,1])#,[r'$\mathcal{N}((0,0),1)$',r'$\mathcal{N}((5,5),1)$'])
plt.colorbar()
plt.title('True transition matrix')
Out[3]:
In [4]:
# Use a library function for computing a state decomposition.
# In general we don't know the correct number of states beforehand, so let's over-estimate here:
from msmbuilder.cluster import MiniBatchKMedoids
kmed = MiniBatchKMedoids(n_clusters=8)
kmed.fit([X])
Out[4]:
In [5]:
# assign each point in the trajectory to a partition
clusters = kmed.transform([X])[0]
In [6]:
clusters
Out[6]:
In [7]:
plt.scatter(X[:,0],X[:,1],c=clusters, linewidths=0)
plt.title('All points in trajectory, colored by cluster')
Out[7]:
In [8]:
uniform = npr.rand(100000,2)*15-5
uniform_clusters = kmed.transform([uniform])[0]
In [9]:
plt.scatter(uniform[:,0],uniform[:,1],c=uniform_clusters, linewidths=0)
plt.xlim(-5,10)
plt.ylim(-5,10)
plt.colorbar()
Out[9]:
In [10]:
plt.hist(uniform_clusters,bins=len(set(uniform_clusters)),normed=True);
plt.xlabel('Cluster ID')
plt.ylabel('Fractional volume')
Out[10]:
In [11]:
from msmbuilder.msm import MarkovStateModel
In [12]:
msm = MarkovStateModel()
msm.fit(clusters)
Out[12]:
In [13]:
clusters
Out[13]:
In [14]:
msm.score_ll([clusters])
Out[14]:
In [15]:
msm.score_ll(kmed.transform([X_test]))
Out[15]:
In [16]:
T = msm.transmat_
In [17]:
# T is row-normalized
T[0],np.sum(T[0])
Out[17]:
In [18]:
# not column-normalized
np.sum(T[:,0])
Out[18]:
In [19]:
# let's compute the volume of each partition
def compute_cluster_volumes(total_volume,uniform_cluster_assignments):
counts_per_cluster = np.zeros(uniform_cluster_assignments.max()+1)
for x in uniform_cluster_assignments:
counts_per_cluster[x] += 1.0
fraction_in_cluster = counts_per_cluster / len(uniform_cluster_assignments)
volumes = fraction_in_cluster * total_volume
return volumes
total_volume = (uniform.max() - uniform.min())**(uniform.shape[1])
volumes = compute_cluster_volumes(total_volume,uniform_clusters)
In [20]:
T[0],volumes
Out[20]:
In [21]:
k = 1.0 / volumes
k*T[0]
Out[21]:
In [22]:
plt.imshow(T + T.T,interpolation='none',cmap='Blues');
plt.title("Transition matrix")
Out[22]:
In [23]:
plt.bar(range(len(k)),k)
plt.figure()
plt.bar(range(len(k)),volumes)
Out[23]:
In [24]:
# now let's score a trajectory
In [25]:
# prior probability *density* vector $\pi$
pi = msm.populations_*k
pi
Out[25]:
In [26]:
msm.populations_
Out[26]:
In [27]:
# log probabilities
log_pi = np.log(pi)
log_normalized_T = np.log(k*T)
In [28]:
def score_correctly_ll(sequence,log_pi,log_normalized_T):
log_prob = log_pi[sequence[0]]
for i in range(1,len(sequence)):
log_prob += log_normalized_T[sequence[i-1],sequence[i]]
return log_prob
In [29]:
score_correctly_ll(clusters,log_pi,log_normalized_T)
Out[29]:
In [30]:
score_correctly_ll(kmed.transform([X_test])[0],log_pi,log_normalized_T)
Out[30]:
In [31]:
# sanity check: what if we have a 1-state MSM?
# in MSMBuilder, we would first perform a clustering, then pass the sequence of cluster labels to be scored:
one_state_clusters = [np.zeros(len(clusters))]
msm = MarkovStateModel()
msm.fit(one_state_clusters)
print('MSMBuilder log likelihood: {0}'.format(msm.score_ll(one_state_clusters)))
# this leads to a probability of 1.0 for any observation sequence, even though the model contains no information
# for the corrected log_likelihood, we account for the volume of the cluster
log_pi = np.log(1.0/total_volume)
print('Corrected log likelihood: {0:.3f}'.format(log_pi*len(clusters)))
# that's thankfully a bit lower than the multi-state MSM log-likelihood-- sanity check passed!
In [32]:
MarkovStateModel().fit([np.zeros(1000)]).score_ll([np.zeros(1000)])
Out[32]:
In [33]:
from msmbuilder.example_datasets import FsPeptide
In [34]:
fs = FsPeptide().get().trajectories
In [35]:
from msmbuilder.featurizer import DihedralFeaturizer
In [36]:
dih = DihedralFeaturizer(['phi', 'psi', 'omega', 'chi1', 'chi2', 'chi3', 'chi4']).fit_transform(fs)
In [37]:
dih[0].shape
Out[37]:
In [38]:
np.vstack(dih).max(),np.vstack(dih).min()
Out[38]:
In [39]:
kmed_fs = MiniBatchKMedoids(n_clusters=10)
kmed_fs.fit(dih)
Out[39]:
In [40]:
clusters_fs = kmed_fs.transform(dih)
In [41]:
# simple monte carlo: uniformly throw darts in a 148-dimensional cube
uniform_fs = np.array(npr.rand(1000000,dih[0].shape[1]),dtype='float32')*2 - 1
uniform_clusters_fs = kmed_fs.transform([uniform_fs])[0]
In [42]:
frac_volume = np.array([np.sum(uniform_clusters_fs==i) for i in range(10)],dtype=float) / len(uniform_clusters_fs)
In [43]:
plt.bar(range(10),frac_volume);
plt.xlabel('Cluster ID')
plt.ylabel('Fractional volume')
Out[43]:
In [44]:
occupancy = np.array([np.sum(np.vstack(clusters_fs)==i) for i in range(10)],dtype=float)
frac_occupancy = occupancy / np.sum(occupancy)
plt.bar(range(10),frac_occupancy);
plt.xlabel('Cluster ID')
plt.ylabel('Fractional occupancy')
Out[44]:
In [45]:
point_density = frac_occupancy / frac_volume
plt.bar(range(10),point_density / np.min(point_density));
plt.xlabel('Cluster ID')
plt.ylabel('Relative point density')
Out[45]:
In [46]:
# in this case, bigger clusters are not necessarily more populous, with some clusters up to ~40x more dense with observed points
# than others
In [47]:
uniform_clusters_fs
Out[47]:
In [48]:
# total volume: yikes! that's large!
total_volume_fs = 2.0**148
total_volume_fs
Out[48]:
In [49]:
# Now we can score this MSM the same way as above...
volumes_fs = compute_cluster_volumes(total_volume_fs,uniform_clusters_fs)
In [50]:
volumes_fs
Out[50]:
In [51]:
msm_fs = MarkovStateModel(prior_counts=1)
msm_fs.fit(clusters_fs)
Out[51]:
In [52]:
def compute_normalized_log_probs_from_msm(msm,volumes):
k = 1.0/volumes
pi = msm.populations_*k
log_pi = np.log(pi)
T = msm.transmat_
log_normalized_T = np.log(k*T)
return log_pi,log_normalized_T
In [53]:
log_pi_fs,log_normalized_T_fs = compute_normalized_log_probs_from_msm(msm_fs,volumes_fs)
In [54]:
lls = []
for traj in clusters_fs:
lls.append(score_correctly_ll(traj,log_pi_fs,log_normalized_T_fs))
In [55]:
np.sum(lls)
Out[55]:
Another proposed likelihood function: write $$\mathcal{L}(D | T, S) = \left[\prod_{n=1}^{N}T_{s_{n-1}}{s_n}\right] \left[\prod_{n=1}^{N} \frac{e^{-U(x_n)}}{\pi_{s_n} \right] \left[\prod_{n=1}^{N} \pi_{s_i}\right]$$
In [ ]: